RNASeqvsIsoSeq of

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: ggplot2
## Loading required package: magrittr
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine

Objective: To tabulate the number of full-length reads obtained per gene from Isoseq and order genes from high to low, for comparison with RNAseq data for exact sample

Rationale: To evaluate whether Isoseq output comparable to RNAseq output

Analysis: 1. Downloaded raw subread.bam file from Sequel output 2. CCS and Isoseq3 command line (Lima, Cluster, Polish) 3. Mapped to mouse genome using GMAP 4. Tofu Cupcake 5. Sqanti for isoform characterisation

Step 1) IsoSeq Preparation: Annotate2Abundance

Define function for Importing and Merging SQANTI classification file and TOFU abundance file

Input: Sqanti_Filter Classification Output file * All details of HQ-unique isoforms classified by assigning PacBio output gene Cluster ID to mouse gene name

Input: ToFU Abundance Output file * Quantification of number of Full_Length per PacBio_ID

Output: Merged txt file by PacBio ID * Merged txt file has the gene name by which the isoform belongs to (as identified by SQANTI) and the quantification of FL_counts (as quantified in TOFU) by PacBio ID

[1] "Input SQANTI Filter output file for Sample O23"
[1] "SQANTI Classification file of Sample O23"
    isoform chrom strand length exons structural_category associated_gene
1 PB.7291.2  chr6      -    641     3   full-splice_match          Mrps33
  associated_transcript ref_length ref_exons diff_to_TSS diff_to_TTS
1  ENSMUST00000031978.8       1107         3          95         371
  diff_to_gene_TSS diff_to_gene_TTS subcategory RTS_stage all_canonical
1                0               17  multi-exon     FALSE     canonical
  min_sample_cov min_cov min_cov_pos             sd_cov FL n_indels
1              1      80  junction_2               60.5 34        0
  n_indels_junc bite iso_exp gene_exp ratio_exp FSM_class coding
1            NA TRUE      NA       NA        NA         C coding
  ORF_length CDS_length CDS_start CDS_end perc_A_downstream_TTS
1        106        321        57     377                  25.0
  dist_to_cage_peak within_cage_peak polyA_motif polyA_dist
1                 4             True          NA         NA
 [ reached getOption("max.print") -- omitted 5 rows ]
[1] "Input SQANTI Filter output file for Sample O23"
[1] "/gpfs/mrc0/projects/Research_Project-MRC148213/sl693/WholeTranscriptome/Individual/ToFU/O23.collapsed.filtered.abundance.txt"
[1] "Abundance file of Sample O23"
   PacBio_Id count_fl count_nfl count_nfl_amb    norm_fl   norm_nfl
15    PB.9.1        7         7             7 2.3785e-05 2.3785e-05
16    PB.9.2        8         8             8 2.7183e-05 2.7183e-05
17   PB.10.1        2         2             2 6.7957e-06 6.7957e-06
18   PB.10.2        4         4             4 1.3591e-05 1.3591e-05
19   PB.11.1        2         2             2 6.7957e-06 6.7957e-06
20   PB.12.1        6         6             6 2.0387e-05 2.0387e-05
   norm_nfl_amb
15   2.3785e-05
16   2.7183e-05
17   6.7957e-06
18   1.3591e-05
19   6.7957e-06
20   2.0387e-05
[1] "Merged file of SQANTI Classification and Abundance File of Sample O23"
    isoform chrom strand length exons     structural_category
1   PB.10.2  chr1      +   3050     6       full-splice_match
  associated_gene associated_transcript ref_length ref_exons diff_to_TSS
1          Pcmtd1 ENSMUST00000061280.16       5232         6          33
  diff_to_TTS diff_to_gene_TSS diff_to_gene_TTS     subcategory RTS_stage
1        2149               23             -643      multi-exon     FALSE
  all_canonical min_sample_cov min_cov min_cov_pos             sd_cov FL
1     canonical              1       4  junction_1 26.502830037563914  4
  n_indels n_indels_junc bite iso_exp gene_exp ratio_exp FSM_class coding
1        3            NA TRUE      NA       NA        NA         C coding
  ORF_length CDS_length CDS_start CDS_end perc_A_downstream_TTS
1        357       1074       380    1453                  70.0
  dist_to_cage_peak within_cage_peak polyA_motif polyA_dist count_fl
1               -31             True          NA         NA        4
  count_nfl count_nfl_amb    norm_fl   norm_nfl norm_nfl_amb
1         4             4 1.3591e-05 1.3591e-05   1.3591e-05
 [ reached getOption("max.print") -- omitted 5 rows ]

Step 2) IsoSeq Preparation: SumFLCounts

Define function that the FL Counts for all transcripts per gene

Motivation: SQANTI Filter classification outputs one gene with multiple isoforms, thus complicates correlation with RNA-Seq Gene Expression Counts. PacBio FL count is presented per isoform rather than per gene. However, FeatureCount’s output from RNA-seq data is on a gene level. Therefore FL counts from IsoSeq needs to be summed for more convenient comparison: Total FL Counts of Transcripts per Gene from IsoSeq vs Raw Gene Counts from RNASeq

Alternative option: select only isoform with the highest number of FL counts, yet biased results especially given if many isoforms with similar or slgihtly smaller number of FL-counts. Assumptions: RNA-seq captures expression of all RNA transcripts irrespective of isoforms

# A tibble: 6 x 3
  associated_gene PacBio_Isoform PacBio_FL_Counts
  <chr>           <fct>                     <int>
1 0610009B22Rik   PB.1205.1                     7
2 0610010F05Rik   PB.1131.1                     9
3 0610012G03Rik   PB.3467.1                    15
4 0610037L13Rik   PB.6184.1                     8
5 1110004F10Rik   PB.8224.1                    44
6 1110008L16Rik   PB.1963.1                     5

Step 3) RNASeq Preparation

Input: FeatureCounts of all RNASeq samples (STAR Aligned to mm10 genome, and annotated to Gencode Mouse V20 gtf file) at gene level Output: FeatureCount of specific sample

[1] "Input FeatureCount for All Samples"
                            B21 C21 K17 K23 M21 O23 Q21 S23
ENSMUSG00000000001.4_Gnai3  761 565 374 523 582 375 418 410
ENSMUSG00000000003.15_Pbsn    0   0   0   0   0   0   0   0
ENSMUSG00000000028.15_Cdc45  19  20  20  25  32  24  24  24
ENSMUSG00000000031.16_H19     1   2   3   2   0   0  12   0
ENSMUSG00000000037.16_Scml2   7  14  12   6  12   7   4  15
ENSMUSG00000000049.11_Apoh    3   3   1   4   0   0   3   2
[1] "Input FeatureCount for Sample O23"
                            RNASeq O23 Raw Counts Mgi_Symbol
ENSMUSG00000000001.4_Gnai3                    375      Gnai3
ENSMUSG00000000028.15_Cdc45                    24      Cdc45
ENSMUSG00000000037.16_Scml2                     7      Scml2
ENSMUSG00000000056.7_Narf                     774       Narf
ENSMUSG00000000058.6_Cav2                     281       Cav2
ENSMUSG00000000078.7_Klf6                     563       Klf6
[1] "Validation of summing PacBio FL"
[1] "Original input data from ToFU Abundance files for the Gene App"
[1] "Summed PacBio FL count for the Gene App saved as new dataframe for downstream analysis"
[[1]]
     associated_gene count_fl
5023             App        3
5024             App        3
5025             App      104
5026             App        2
5027             App       33
5028             App        3
5029             App        5
5030             App        2
5031             App        4
5032             App        3

[[2]]
# A tibble: 1 x 3
  associated_gene PacBio_Isoform PacBio_FL_Counts
  <chr>           <fct>                     <int>
1 App             PB.3566.1                   162

Step 4) Merge RNASeq and IsoSeq

Input: Sample-specific Isoseq (Dataframe: Merge_IsoSeq_SumFL) and RNASeq (Dataframe: RNASeq) Counts Output: Dataframe “Full_Merge”: Merged Counts across IsoSeq and RNASeq by gene names

Also to call out specific counts of AD-associated genes, created function AD_Counts.

  associated_gene PacBio_Isoform PacBio_FL_Counts RNASeq O23 Raw Counts
1   0610005C13Rik           <NA>               NA                    12
2   0610009B22Rik      PB.1205.1                7                   148
3   0610009E02Rik           <NA>               NA                     9
4   0610009L18Rik           <NA>               NA                    29
5   0610010F05Rik      PB.1131.1                9                   413
6   0610010K14Rik           <NA>               NA                    14
      associated_gene PacBio_Isoform PacBio_FL_Counts
1487             Apoe      PB.7782.1             1881
1494              App      PB.3566.1              162
10177            Mapt      PB.1690.1               42
12935           Psen1      PB.2058.1                7
      RNASeq O23 Raw Counts
1487                  23649
1494                  17898
10177                  6759
12935                   683

Step 5) Data Review for Full_Merge: RNASeq vs IsoSeq

Motivation: Within Full_Merge dataframe, interested to know which genes are detected only by IsoSeq, only by RNASeq, and alone. Also later downstream, able to plot the number of respective counts for these genes.

[1] "Total Number of Genes in Full_Merge of IsoSeq and RNASeq: 17969"
[1] "Total Number of Genes Detected in IsoSeq AND RNASeq: 8774"
[1] "Total Number of Genes Detected in IsoSeq but not RNASeq: 133"
[1] "Total Number of Genes Detected in RNASeq but not IsoSeq: 9062"
[1] "/gpfs/mrc0/projects/Research_Project-MRC148213/sl693/RNASeq/Correlations/O23_Full_Merge.csv"

Step 6) Correlation of RawData

output: Correlation of Gene Expression of IsoSeq FL Counts vs RNASeq Raw Counts. Correlation coefficient calculated from pearson’s method (assuming parametric) and considers

Step 7) Correlation of log(Data)

motivation: As seen above, due to densely populated points of numbers with several extreme values, difficult to see plot. Thus, logged points for visual output: Correlation of Gene Expression of log(10)(IsoSeq FL Counts) vs log(10)(RNASeq Raw Counts). Note, correlation coefficient doesn’t change. However, as it is not possible to log 0, can only consider genes detected in both technology.

Step 7) Missing Reads

input: Genes either detected by IsoSeq or RNASeq from Full_Merge dataframe (IsoSeq and RNASeq Counts/gene) output: Plot of those genes with its respective counts

[1] "Genes with no IsoSeq Reads but RNASeq RawCounts > 5000"
 [1] "Ap2m1"    "Apbb1"    "Dnm1"     "Dst"      "Huwe1"    "Kcnj10"  
 [7] "Mapk8ip3" "Shank1"   "Slc12a5"  "Stum"     "Syngap1"  "Syt7"    
[13] "Unc13a"   "Xist"    
[1] "Genes with only IsoSeq Reads, and no RNASeq Reads"
 [1] "4930509H03Rik" "4930578G10Rik" "A730089K16Rik" "A930015D03Rik"
 [5] "Aarsd1"        "AC110573.1"    "AC121802.1"    "AC124484.1"   
 [9] "AL731706.1"    "B230362B09Rik" "B3gnt2"        "C1qtnf5"      
[13] "Ccdc22"        "D630033A02Rik" "Entpd4"        "Entpd4b"      
[17] "Epo"           "Exosc6"        "Fam177a"       "Fen1"         
[21] "Galnt2"        "Gemin4"        "Gm10108"       "Gm11518"      
[25] "Gm13370"       "Gm14440"       "Gm15972"       "Gm19409"      
[29] "Gm20186"       "Gm20388"       "Gm20427"       "Gm20458"      
[33] "Gm20460"       "Gm20634"       "Gm20662"       "Gm20683"      
[37] "Gm20695"       "Gm21969"       "Gm21974"       "Gm21988"      
[41] "Gm26551"       "Gm26561"       "Gm26668"       "Gm26786"      
[45] "Gm26904"       "Gm27029"       "Gm28052"       "Gm29232"      
[49] "Gm29253"       "Gm3002"        "Gm3448"        "Gm3591"       
[53] "Gm38182"       "Gm42416"       "Gm42418"       "Gm42420"      
[57] "Gm42466"       "Gm42936"       "Gm44321"       "Gm44503"      
[61] "Gm45021"       "Gm45140"       "Gm45153"       "Gm45213"      
[65] "Gm45234"       "Gm45837"       "Gm47580"       "Gm49032"      
[69] "Gm49321"       "Gm49354"       "Gm49358"       "Gpr25"        
[73] "Gstp2"         "Gtsf1"         "H2-Ke6"       
 [ reached getOption("max.print") -- omitted 58 entries ]

Session Info

R version 3.3.1 (2016-06-21)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] ggthemes_4.2.0   gridExtra_2.3    plotly_4.9.0     ggpubr_0.2.1    
 [5] magrittr_1.5     ggplot2_3.2.0    dplyr_0.8.1      stringr_1.4.0   
 [9] rmdformats_0.3.5 knitr_1.23      

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.5   xfun_0.7           purrr_0.3.2       
 [4] colorspace_1.4-1   vctrs_0.1.0        miniUI_0.1.1.1    
 [7] htmltools_0.3.6    viridisLite_0.3.0  yaml_2.2.0        
[10] utf8_1.1.4         rlang_0.3.4        later_0.8.0       
[13] pillar_1.4.1       glue_1.3.1         withr_2.1.2       
[16] RColorBrewer_1.1-2 questionr_0.7.0    munsell_0.5.0     
[19] ggsignif_0.5.0     gtable_0.3.0       htmlwidgets_1.3   
[22] codetools_0.2-14   evaluate_0.14      labeling_0.3      
[25] crosstalk_1.0.0    httpuv_1.5.1       fansi_0.4.0       
[28] highr_0.8          Rcpp_1.0.1         xtable_1.8-4      
[31] backports_1.1.4    promises_1.0.1     scales_1.0.0      
[34] jsonlite_1.6       mime_0.7           digest_0.6.19     
[37] stringi_1.4.3      bookdown_0.11      shiny_1.3.2       
[40] cli_1.1.0          tools_3.3.1        lazyeval_0.2.2    
[43] tibble_2.1.3       crayon_1.3.4       tidyr_0.8.3       
[46] pkgconfig_2.0.2    zeallot_0.1.0      MASS_7.3-51.4     
[49] data.table_1.12.2  assertthat_0.2.1   rmarkdown_1.13    
[52] httr_1.4.0         rstudioapi_0.10    R6_2.4.0          

Szi Kay Leung

Report created: 2019-07-13